Remote File Virtualization in a Switched File System

ABSTRACT

A plurality of network file manager switches interoperate to provide remote file virtualization. Copies of file data and/or metadata are maintained at a central site and at one or more remote sites. The network file manager switch at the remote site may satisfy certain client requests locally without having to contact the network file manager switch at the central site. A global namespace is maintained and is communicated to all network file manager switches.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority from the following United StatesProvisional patent applications, each of which is hereby incorporatedherein by reference in its entirety:

U.S. Provisional Patent Application No. 60/940,104 entitled REMOTE FILEVIRTUALIZATION filed on May 25, 2007 (Attorney Docket No. 3193/116);

U.S. Provisional Patent Application No. 60/987,161 entitled REMOTE FILEVIRTUALIZATION METADATA MIRRORING filed on Nov. 12, 2007 (AttorneyDocket No. 3193/117);

U.S. Provisional Patent Application No. 60/987,165 entitled REMOTE FILEVIRTUALIZATION DATA MIRRORING filed on Nov. 12, 2007 (Attorney DocketNo. 3193/118); and

U.S. Provisional Patent Application No. 60/987,170 entitled REMOTE FILEVIRTUALIZATION WITH NO EDGE SERVERS filed on Nov. 12, 2007 (AttorneyDocket No. 3193/119).

FIELD OF THE INVENTION

This invention relates generally to switched file systems, and, morespecifically, to remote file virtualization in a switched file system.

BACKGROUND OF THE INVENTION

In a computer network, NAS (Network Attached Storage) file serversprovide file services for clients connected in a computer network usingthe NAS protocols such as NFS or CIFS. Historically, clients and fileservers are usually located in the same geographical location and areconnected in a local area computer network (LAN). LAN usually has highnetwork bandwidth and low network latency.

In today's information age, however, clients and file servers are oftenlocated across a wide geographical area and communicate over a wide areanetwork (WAN) such as the Internet. WANs usually have low networkbandwidth and high network latency, compared to LANs. Furthermore, NASprotocols, particularly CIFS, are often “chatty” and require manymessages between a client and a file server in order to retrieve thecontents of an entire file. The chattiness of the CIFS protocolexacerbates the latency problem that often makes accessing remote filesimpractical and intolerable.

One common approach to accelerate remote file access across a WAN is touse a data compression technique to reduce the size or number ofmessages being sent across the WAN. This solution is often referred toas WAN optimization. Under WAN optimization, two optimization appliancesare used, one located at the central site (i.e., near the file servers),and another located at a remote site (i.e., near the clients). Theoptimization appliance at the sending site does the message compressionbefore the message is sent, and the optimization appliance located atthe receiving site reconstructs the original message from the compressedmessage it received. The users or the applications at a remote site arecompletely unaware of this compression/decompression activity. As aresult, the usage of WAN network bandwidth and corresponding networklatency is reduced. WAN optimization is discussed in Robb, Drew; RemoteManagement: WAFS, WAN Optimizes or Wait?,http://www.enterprisestorageforum.com/technology/featres/article.php/3511221,Jun. 8, 2005, which is hereby incorporated herein by reference in itsentirety.

Another common approach to accelerate remote file access across a WAN isto cache file data at the remote site and service (terminate) filerequests at the remote site using the cached data if possible. In thisway, certain client/server communications over the WAN can be avoided.Thus, if a file that was cached at the remote site is accessed by a userat the remote site, file requests for the cached file become much fasterthan usual because a local file access is substantially faster than aremote file access. Caching is discussed in When OpportunityLocks-Oplocks on Windows NT, The NT Insider, Vol. 3, Issue 3, Jun. 1996|Published: 15 Jun. 96| Modified: 26 Aug. 2002, which is herebyincorporated herein by reference in its entirety.

WAN optimization and file caching can be used alone or together andtherefore are considered to be complementary solutions. Generallyspeaking, file caching works reasonably well for file data that does notchange frequently. If a file is cached and is updated at the centralsite, the users at a remote site may not be aware of this and may leadto using the stale file data. Furthermore, the contents of a file mustbe read or pre-fetched to fill the file cache before caching can resultin faster file access. In addition, file caching does not cachedirectory contents. Therefore, directory related operations such aslookup or enumeration will still require client/server communicationover the LAN and will consequently suffer poor performance.

A traditional file system manages the storage space by providing ahierarchical namespace. The hierarchical namespace starts from the rootdirectory, which contains files and subdirectories. Each directory mayalso contain files and subdirectories identifying other files orsubdirectories. Data is stored in files. Every file and directory isidentified by a name. The full name of a file or directory isconstructed by concatenating the name of the root directory and thenames of each subdirectory that finally leads to the subdirectorycontaining the identified file or directory, together with the name ofthe file or the directory.

The full name of a file thus carries with it two pieces of information:(1) the identification of the file and (2) the physical storage locationwhere the file is stored. If the physical storage location of a file ischanged (for example, moved from one partition mounted on a system toanother), the identification of the file changes as well.

For ease of management, as well as for a variety of other reasons, theadministrator would like to control the physical storage location of afile. For example, important files might be stored on expensive,high-performance file servers, while less important files could bestored on less expensive and less capable file servers.

Unfortunately, moving files from one server to another usually changesthe full name of the files and thus, their identification, as well. Thisis usually a very disruptive process, since after the move users may notbe able to remember the new location of their files.

SUMMARY OF THE INVENTION

In accordance with one aspect of the invention there is provided aswitched file system comprising a central network file manager and atleast one remote network file manager in communication coupled to thecentral network file manager via a communication network, wherein thecentral network file manager manages reference copies of data andmetadata and wherein the remote network file managers maintain mirroredcopies of data and metadata for use in servicing client requests withouthaving to communicate with the central network file manager.

In various alternative embodiments, the central network file manager andthe at least one remote network file manager may maintain a commonglobal namespace. The metadata may be mirrored from the central networkfile manager to the at least one remote network file manager using alazy mirroring technique. The metadata may be mirrored, for example, ina breadth-first fashion or in a depth-first fashion.

The central network file manager may push metadata to the at least oneremote network file manager. After pushing metadata to a remote networkfile manager, the central network file manager may verify that themetadata has not changed since being pushed and notify the remotenetwork file manager that the metadata is valid. The central networkfile manager may maintain statistics regarding access patterns by remoteclients and may push the metadata to the at least one remote networkfile manager based on the statistics.

Alternatively, a remote network file manager may pull metadata from thecentral network file manager. After receiving metadata from the centralnetwork file manager, the remote network file manager may requestconfirmation from the central network file manager that the metadata isstill valid. The remote network file manager may maintain statisticsregarding access patterns by clients and may pull the metadata from thecentral network file manager based on the statistics.

Metadata may be updated at a remote network file manager, in which casethe remote network file manager may communicate the updated metadata tothe central network file manager, and the central network file managermay notify the remote network file managers that the remote sitemetadata is unsynchronized so that the remote network file managers donot use the unsynchronized metadata.

Data may be mirrored from the central network file manager to the atleast one remote network file manager using a lazy mirroring technique.When a file is updated at a remote network file manager, the remotenetwork file manager may communicate the updated data to the centralnetwork file manager, and the central network file manager may notifythe remote network file managers that the remote site data isunsynchronized so that the remote network file managers do not use theunsynchronized data. At least one of the central network file managerand the remote network file managers may maintain statistics regardingclient accesses, in which case the data for such data mirroring may beselected based on the statistics.

The remote network file managers may pass oplock requests from clientdevices through to the central network file manager. Additionally oralternatively, the remote network file managers may handle oplock breaksand pass oplock breaks through to the client devices. The remote networkfile managers may flush cached contents back to the central network filemanager, in which case the central network file manager may notify allremote network file managers to break file mirrors for the file.

The data and metadata may be copied from the central network filemanager to the at least one remote network file manager according to aset of rules.

The remote network file manager may disallow access to mirrored copiesof data and metadata when the remote network file manager is unable tocommunicate with the central network file manager over the communicationnetwork. Additionally or alternatively, the remote network file managermay disallow modification of mirrored copies of data and metadata whenthe remote network file manager is unable to communicate with thecentral network file manager over the communication network.

In accordance with another aspect of the invention there is provided anetwork file manager that operates as a client to file server nodes andas a server to client nodes and interacts with both the client nodes andthe file server nodes using the standard network file protocols, whereinthe network file manager implements SMB signing on communications withthe file server nodes including SMB signing on messages used topre-fetch data from the file server nodes.

In various alternative embodiments, the network file manager may furtherimplement data compression on communications with the file server nodes.

In accordance with another aspect of the invention there is provided aWAN optimization appliance that operates as a client to file servernodes, wherein the appliance implements SMB signing on communicationswith the file server nodes including SMB signing on messages used topre-fetch data from the file server nodes.

In various alternative embodiments, the appliance may implement datacompression on communications with the file server nodes.

In accordance with another aspect of the invention there is provided aWAN optimization appliance comprising a broadcast service for deliveringmirror break messages reliably and in priority from the central site tothe remote sites.

In accordance with another aspect of the invention there is provided aWAN optimization appliance comprising a file transfer service forpre-positioning files from a central site to a number of remote sites.Additionally or alternatively, the appliance may obtain optimalfingerprints from a set of files to be pre-positioned and pre-positionsthese fingerprints to remote devices. The appliance may obtainfingerprints from file objects in a global namespace for fingerprintpreloading at remote sites.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and advantages of the invention will be appreciated morefully from the following further description thereof with reference tothe accompanying drawings wherein:

FIG. 1 is a schematic block diagram of a switched file system inaccordance with various embodiments of the invention described in therelated application incorporated by reference above;

FIG. 2 is a schematic block diagram of a switched file system employingremote file virtualization in accordance with an exemplary embodiment ofthe present invention;

FIG. 3 depicts an oplock break sequence in accordance with an exemplaryembodiment of the present invention;

FIG. 4 shows a representation of virtual partitions that are “carved”out of the namespace such that all of the namespaces contained in eachvirtual partition are non-overlapping and the union of all thenamespaces contained in each virtual partition is the same as the entireglobal namespace itself, in accordance with an exemplary embodiment ofthe present invention;

FIG. 5 shows a representation of an exemplary Table of PartitionsTransactions in accordance with an exemplary embodiment of the presentinvention;

FIG. 6 shows a representation of an exemplary Table of DirectoryTransactions or Log in accordance with an exemplary embodiment of thepresent invention;

FIG. 7 shows a representation of an exemplary Table of Remote SiteReplay Transactions in accordance with an exemplary embodiment of thepresent invention;

FIG. 8 shows a representation of an exemplary single persistent valuekept for each directory in the partition on the remote site inaccordance with an exemplary embodiment of the present invention;

FIG. 9 is a logic flow diagram showing a representation of an exemplaryalgorithm to determine if the remote site's mirror copy of the namespaceis synchronized enough, in accordance with an exemplary embodiment ofthe present invention;

FIG. 10 is a logic flow diagram showing a representation of an exemplaryalgorithm for performing synchronization in accordance with an exemplaryembodiment of the present invention;

FIGS. 11-16 show representations of the files and directories in asample partition as well as representations of how the Table ofPartition Transactions and the Tables of Directory Transactions aremaintained as files and directories are added and deleted from thesample partition, in accordance with an exemplary embodiment of thepresent invention;

FIG. 17 shows an exemplary switched file system in which WANOptimization Appliances are interposed between the remote file switchand the central file switch;

FIG. 18 shows a file switched system having two file switches with WANoptimization functionality in accordance with an exemplary embodiment ofthe present invention; and

FIG. 19 shows an exemplary system including two WAN OptimizationAppliances with SMB signing functionality in accordance with anexemplary embodiment of the present invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Definitions. As used in this description and related claims, thefollowing terms sha1 have the meanings indicated, unless the contextotherwise requires:

Aggregator. An “aggregator” is a file switch that performs the functionof directory, data or namespace aggregation of a client data file over afile array.

Data Stream. A “data stream” is a segment of a stripe-mirror instance ofa user file. If a data file has no spillover, the first data stream isthe stripe-mirror instance of the data file. But if a data file hasspillovers, the stripe-mirror instance consists of multiple datastreams, each data stream having metadata containing a pointer pointingto the next data stream. The metadata file for a user file contains anarray of pointers pointing to a descriptor of each stripe-mirrorinstance; and the descriptor of each stripe-mirror instance in turncontains a pointer pointing to the first element of an array of datastreams.

File Array. A “file array” consists of a subset of servers of a NASarray that are used to store a particular data file.

File Switch. A “file switch” is a device (or group of devices) thatperforms file aggregation, transaction aggregation and directoryaggregation functions, and is physically or logically positioned betweena client and a set of file servers. To client devices, the file switchappears to be a file server having enormous storage capabilities andhigh throughput. To the file servers, the file switch appears to be aclient. The file switch directs the storage of individual user filesover multiple file servers, using striping to improve throughput andusing mirroring to improve fault tolerance as well as throughput. Theaggregation functions of the file switch are done in a manner that istransparent to client devices. The file switch preferably communicatesthe with clients and with the file servers using standard fileprotocols, such as CIFS or NFS. The file switch preferably provides fullvirtualization of the file system such that data can be moved withoutchanging path names and preferably also allowsexpansion/contraction/replacement without affecting clients or changingpathnames.

Switched File System. A “switched file system” is defined as a networkincluding one or more file switches and one or more file servers. Theswitched file system is a file system since it exposes files as a methodfor sharing disk storage. The switched file system is a network filesystem, since it provides network file system services through a networkfile protocol—the file switches act as network file servers and thegroup of file switches may appear to the client computers as a singlefile server.

Data File. In accordance with exemplary embodiments of the presentinvention, a file has two distinct sections, namely a “metadata file”and a “data file”. The “data file” is the actual data that is read andwritten by the clients of a file switch. A file is the main component ofa file system. A file is a collection of information that is used by acomputer. There are many different types of files that are used for manydifferent purposes, mostly for storing vast amounts of data (i.e.,database files, music files, MPEGs, videos). There are also types offiles that contain applications and programs used by computer operatorsas well as specific file formats used by different applications. Filesrange in size from a few bytes to many gigabytes and may contain anytype of data. Formally, a file is a called a stream of bytes (or a datastream) residing on a file system. A file is always referred to by itsname within a file system.

Metadata File. A “metadata file,” also referred as the “metafile,” is afile that contains metadata, or at least a portion of the metadata, fora specific file. The properties and state information (e.g., definingthe layout and/or other ancillary information of the user file) about aspecific file is called metadata. In embodiments of the presentinvention, ordinary clients are typically not permitted to directly reador write the content of the metadata files by issuing read or writeoperations, the clients still have indirect access to ordinary directoryinformation and other metadata, such as file layout information, filelength, etc. In fact, in embodiments of the invention, the existence ofthe metadata files is transparent to the clients, who need not have anyknowledge of the metadata files.

Mirror. A “mirror” is a copy of a file. When a file is configured tohave two mirrors, that means there are two copies of the file.

Network Attached Storage Array. A “Network Attached Storage (NAS) array”is a group of storage servers that are connected to each other via acomputer network. A file server or storage server is a network serverthat provides file storage services to client computers. The servicesprovided by the file servers typically includes a full set of services(such as file creation, file deletion, file access control (lockmanagement services), etc.) provided using a predefined industrystandard network file protocol, such as NFS, CIFS or the like.

Oplock. An oplock, also called an “opportunistic lock” is a mechanismfor allowing the data in a file to be cached, typically by the user (orclient) of the file. Unlike a regular lock on a file, an oplock onbehalf of a first client is automatically broken whenever a secondclient attempts to access the file in a manner inconsistent with theoplock obtained by the first client. Thus, an oplock does not actuallyprovide exclusive access to a file; rather it provides a mechanism fordetecting when access to a file changes from exclusive to shared, andfor writing cached data back to the file (if necessary) before enablingshared access to the file.

Spillover. A “spillover” file is a data file (also called a data streamfile) that is created when the data file being used to store a stripeoverflows the available storage on a first file server. In thissituation, a spillover file is created on a second file server to storethe remainder of the stripe. In the unlikely case that a spillover fileoverflows the available storage of the second file server, yet anotherspillover file is created on a third file server to store the remainderof the stripe. Thus, the content of a stripe may be stored in a seriesof data files, and the second through the last of these data files arecalled spillover files.

Strip. A “strip” is a portion or a fragment of the data in a user file,and typically has a specified maximum size, such as 32 Kbytes, or even32 Mbytes. Each strip is contained within a stripe, which is a data filecontaining one or more strips of the user file. When the amount of datato be stored in a strip exceeds the strip's maximum size, an additionalstrip is created. The new strip is typically stored in a differentstripe than the preceding stripe, unless the user file is configured (bya corresponding aggregation rule) not to be striped.

Stripe. A “stripe” is a portion of a user file. In some cases an entirefile will be contained in a single stripe, but if the file being stripedbecomes larger than the stripe size, an additional stripe is typicallycreated. In the RAID-5 scheme, each stripe may be further divided into Nstripe fragments. Among them, N−1 stripe fragments store data of theuser file and one stripe fragment stores parity information based on thedata. Each stripe may be (or may be stored in) a separate data file, andmay be stored separately from the other stripes of a data file. Asdescribed elsewhere in this document, if the data file (also called a“data stream file”) for a stripe overflows the available storage on afile server, a “spillover” file may be created to store the remainder ofthe stripe. Thus, a stripe may be a logical entity, comprising aspecific portion of a user file, that is distinct from the data file(also called a data stream file) or data files that are used to storethe stripe.

Stripe-Mirror Instance. A “stripe-mirror instance” is an instance (i.e.,a copy) of a data file that contains a portion of a user file on aparticular file server. There is one distinct stripe-mirror instance foreach stripe-mirror combination of the user file. For example, if a userfile has ten stripes and two mirrors, there will be twenty distinctstripe-mirror instances for that file. For files that are not striped,each stripe-mirror instance contains a complete copy of the user file.

Subset. A subset is a portion of thing, and may include all of thething. Thus a subset of a file may include a portion of the file that isless than the entire file, or it may include the entire file.

User File. A “user file” is the file or file object that a clientcomputer works with (e.g., read, write, etc.), and in some contexts mayalso be referred to as an “aggregated file.” A user file may be dividedinto portions and stored in multiple file servers or data files within aswitched file system.

File Virtualization in a Switched File System

FIG. 1 is a schematic block diagram of a switched file system inaccordance with various embodiments of the invention described in therelated application incorporated by reference above. Specifically, afile switch (which may also be referred to as a file virtualizationappliance or MFM) is in communication with a number of clients over acommunication network and is in communication with a number of fileservers over the same or a different communication network. The fileswitch may also be in communication with one or more directly connectedfile servers. Thus, the file switch sits in the data path (eitherphysically or logically) between the clients and the file servers forcertain transactions. In specific embodiments, the file switch may beembodied as a product from Attune Systems, Inc. referred to as MaestroFile Manager (MFM). The MFM may be provided in at least two differentversions, specifically a standard version referred to as the FM5500 anda high-availability version referred to as the FM5500-HA.

The file switch may support a wide range of features and functionalitysuch as, for example, providing a unified global namespace, providingstorage virtualization, and managing storage of files in the fileservers. File virtualization decouples file names from the physical filestorage locations and hides the physical storage attributes of the filesfrom the clients so that the users or applications are completelyunaware which file server (or file servers) actually handles the fileaccess. The file switch may store a file in a single file server oracross multiple file servers, and may store files so as to emulatemirroring, striping, or other redundancy schemes. A native mode may besupported in which clients may communicate directly with the fileservers in order to access certain files. The file switch may managefile storage based on a set of rules and may support reapply andrelayout functions. The file switch may store certain small files alongwith metadata. The file switch may support other features described inthe related applications.

As a result of separating the full name of a file from the file'sphysical storage location, file virtualization provides the followingcapabilities:

-   -   1) Creation of a synthetic namespace

Once a file is virtualized, the full filename does not provide anyinformation about where the file is actually stored. This leads to thecreation of synthetic directories where the files in a single syntheticdirectory may be stored on different file servers. A synthetic namespacecan also be created where the directories in the synthetic namespace maycontain files or directories from a number of different file servers.Thus, file virtualization allows the creation of a single globalnamespace from a number of cooperating file servers. The syntheticnamespace is not restricted to be from one file server, or one filesystem.

-   -   2) Allows having many full filenames to refer to a single file

As a consequence of separating a file's name from the file's storagelocation, file virtualization also allows multiple full filenames torefer to a single file. This is important as it allows existing users touse the old filename while allowing new users to use a new name toaccess the same file.

-   -   3) Allows having one full name to refer to many files

Another consequence of separating a file's name from the file's storagelocation is that one filename may refer to many files. Files that areidentified by a single filename need not contain identical contents. Ifthe files do contain identical contents, then one file is usuallydesignated as the authoritative copy, while the other copies are calledthe mirror copies. Mirror copies increase the availability of theauthoritative copy, since even if the file server containing theauthoritative copy of a file is down, one of the mirror copies may bedesignated as a new authoritative copy and normal file access can thenresumed. On the other hand, the contents of a file identified by asingle name may change according to the identity of the user who wantsto access the file.

In exemplary embodiments of the invention, the file switch appears tothe clients as a standard file server and appears to the file servers asa standard client. In such embodiments, communication between theclients and the file switch can utilize standard network file protocols(e.g., NFS and/or CIFS) without requiring any additional softwarerunning in the clients, and communication between the file switch andthe file servers can utilize standard network file protocols (e.g., NFSand/or CIFS) without requiring any additional software running on theservers. In fact, the file switch could utilize one network fileprotocol when communicating with the clients and a different networkfile protocol when communicating with the file servers in certainembodiments. Additionally, or alternatively, the file switch maycommunicate with different types of clients using different protocols(e.g., some clients may use NFS while other clients may use CIFS), and,similarly, the file switch may communicate with different types of fileservers using different protocols (e.g., some file servers may use NFSwhile other file servers may use CIFS). In one exemplary embodiment, thefile switch may communicate with both NFS and CIFS clients but storefiles in the file servers using only CIFS. Since the file switchessentially operates as both a network file client and a network fileserver, the file switch may support a full range of client/serverfeatures such as, for example, SMB signing for authenticatingcommunications with the clients and/or with the file servers.

Remote File Virtualization

A typical business environment may have branch offices located at manyremote sites across a wide geographical area. However, the datacenter(s) that hosts the file servers are usually centralized in one ortwo sites. This allows for economies of scale, ease of management, aswell as providing physical security.

Users at branch offices often need to access data stored at the centralsite. Unfortunately, the transmission speed of the wide area networkconnecting the users at the branch offices to the file servers locatedat the central site is usually much lower than the speed of the localarea network (LAN). This is partly due to the cost of network connectionlinks as well as the latency introduced by the physical distanceseparating a branch office from the central site. To overcome thetransmission speed and to reduce latency of the WAN, one scheme is todeploy a “latency reduction” or “WAN access optimization” appliance atboth the central site and at the remote site. However, a better strategyis to reduce or eliminate the need to send network packets across theWAN, for example, by satisfying as many of the file requests locally(i.e., at the remote site) as possible instead of having to send therequests across the WAN.

Furthermore, if certain files are typically authored or modifiedlocally, it would be efficient to operate on local copies of the files.Normally, this would be solved by keeping the file locally at the remoteoffice on an edge server (i.e., managed file servers at the remotesites). However, since the branch offices are not true data centers,there may be issues with managing these servers or NAS devices at theremote site, including backups, restores, and ongoing maintenance.Therefore, certain embodiments remove the need for managed edge serverswhile still providing the ability to write file locally at the remotesite.

Thus, it is desirable for file virtualization to work both at thecentral site as well as the remote sites. It is also desirable that thecentral site and the remote sites share the same common namespace.Embodiments of the present invention described below extend filevirtualization across the WAN in order to accelerate remote file access.For convenience, such extended file virtualization is referred tohereinafter as Remote File Virtualization.

FIG. 2 is a schematic block diagram of a switched file system employingremote file virtualization in accordance with an exemplary embodiment ofthe present invention. Here, the switched file system includes two fileswitches, namely a central file switch situated near the file serversand a remote file switch situated near the clients. The central fileswitch and the remote file switch are in communication over a WAN suchas the Internet. In this example, the remote file switch appears to theclients as a file server and appears to the central file switch as aclient, while the central file switch appears to the remote file switchas a file server and appears to the file servers as a client. It shouldbe noted that multiple remote file switches may operate with a singlecentral file switch over the WAN.

In order to help reduce or eliminate certain communications over theWAN, copies of file data and/or metadata may be stored at the centralsite and at one or more of the remote site(s). One copy is typicallyconsidered to be the “authoritative” copy while the other copies areconsidered to be “mirror” copies. The authoritative copy may be at thecentral site or at one of the remote sites. Examples of both situationsare described below. A mirror server is storage that may contain thecurrent, past, or both current and past mirror copies of theauthoritative copy of a file. No particular directory structure isassumed. A file virtualization appliance, such as the MFM describedabove, is responsible for keeping the contents of the mirror copies insync with the authoritative copy. If the contents of a mirror copy arenot identical with the authoritative copy of the file, the mirror isbroken and the mirror copy is generally discarded.

The delay and the relatively less reliable WAN makes it impractical tokeep the contents of the mirror copies stored at one site to beidentical with the authoritative copy stored at another site. Instead ofhaving one site notify all of the other sites to break the mirror if theauthoritative copy has changed, in exemplary embodiments, each site isgenerally responsible for checking if its own mirror copy is identicalwith the authoritative copy. If the mirror copy is identical, then thefile accesses generally can be satisfied locally, resulting in fasterfile access performance. If the contents are not identical, the mirrorcopy is generally not used, in which case file access requests are sentover the WAN to another site for processing. For example, file accessrequests may be forwarded from the central site to a remote site if theauthoritative copy is not present on the central site.

In order to maintain a common namespace between the central site andremote sites, certain synchronization techniques are used to keep thenamespace contents (information about a subset of files within the filesystem) consistent and in sync between the remote sites and the centralsite. A number of exemplary synchronization techniques are describedbelow. Under the common namespace across the central site and all remotesites, applications or users at the remote site will not be aware of theactual location where the file requests are being serviced. By accessinga locally stored copy instead of the copy stored at the central site,users will perceive the situation as if the authoritative copy is storedlocally even if the authoritative copy is actually stored at the centralsite or at another remote site. If the remote file switch is able toservice a particular file request from a client at the remote location,then no communication over the WAN should be needed for that filerequest. As a result, there should be a substantial speed increase forthe remote file access since local file access is typically faster thanan access to the central site.

In certain embodiments of the present invention, remote filevirtualization is accomplished using a lazy metadata mirroring techniquetogether with a lazy file data mirroring technique and a reverse filedata mirroring technique in order to maintain a common namespace acrossthe central site and all the remote sites. These techniques will bedescribed below.

Mirroring

One of the major functions of file virtualization is to provide datamirroring. Since the filename of a file is now independent of itsstorage location, the contents of a file may be served by more than oneserver for increasing availability. If one server is down, a backupserver that contains the identical copy of the file, the mirror, couldbe used instead. Mirroring can be done on a per file basis, on a perdirectory basis, on a per volume basis, or from the result of a policythat identifies a set of files using a specific criterion.

For example, a Server 1 may be the primary server for servicing file Aand a Server 2 may be used as the backup server and contains a mirrorcopy of file A. The MFM is responsible for maintaining the contents ofthe mirror copy of file A in the backup Server 2 to be in sync with thecontents of the original file A in the Server 1. The file A in Server 1is said to be the authoritative copy and is usually updated first andconsulted first.

One way that file virtualization can help accelerate file access from aremote site across the WAN to a file server located at the central siteis to preposition mirror copies of the file from the central site to theremote site (local to users), with the central site designated to storethe authoritative copy of each file, and each remote site maintaining amirror copy of the authoritative copy at the central site. This allowsusing the local mirror copies to satisfy as many file accesses aspossible. As a result, if a user is accessing a mirror copy locally, theuser will perceive that the authoritative copy is stored locally, eventhough the authoritative copy is actually stored at the central site. Inexemplary embodiments, if a file is deleted or modified at the remotesite, the central site is notified first, and then all MFMs at remotesites are notified of the file being deleted or modified, so that allMFMs have their metadata information updated.

In order to perform such mirroring, the MFM typically uses an activemirroring technique that involves applying the same file operation onfile A to both Server 1 and Server 2. This mirroring technique also doesnot distinguish between data operations (read/write) or metadataoperations (lookup, enumeration). All file operations are mirroredactively. Active mirroring generally also assumes that there are only alimited number of mirrors. There is no need, under normal situation, tohave more than two or three mirrors for a file.

Files may be placed on the mirror server by pre-positioning or on thefly, for example, through the File Transfer Protocol (ftp). In anexemplary embodiment, each mirror copy in the mirror sever is identifiedby a 160-bit number, which is the sha1 digest computed from the contentsof the mirror copy. A sha1 digest value is a globally unique value forany given set of data (contents) of a file. Therefore, if two files areidentical in contents (but not necessarily name or location), they willalways have the same sha1 digest values. And conversely, if two filesare different in contents, they should always have different sha1 digestvalues.

Many approaches could be used to manage the storage space of the mirrorserver. For example, the storage space in the mirror server may bereclaimed periodically by purging mirrors that are least recently used.Alternatively, the mirrors are purged one at a time, and only when isneeded, i.e. when storage space is needed in the mirror server to storea new mirror. It is important to note that the mirror server isunmanaged storage. The authoritative copy of the data always lives atthe central site. If the mirror server is lost, or if mirrors are neededto be purged from the mirror server, the authoritative copy of the datacan always be fetched from the central site.

Thus, having a mirror server affects only the read access performanceand not the correctness of the read operation.

The computation of the sha1 digest is performed at the central site andis usually done periodically by a background process. The sha1 digestcomputation process walks through the directory hierarchy associatedwith a partitioned namespace, starting from the root of the directoryhierarchy and inspecting every directory and sub-directory until alldirectories within the partitioned namespace are inspected. For eachfile that is idle (not opened) and without a sha1 digest, the processcomputes the sha1 digest and stores the sha1 digest as an extendedattribute or as an alternate data stream within the metadata of a file.Newly created files do not have sha1 digests immediately after the fileis created. In addition, the sha1 digest of a file, if it exists, iscleared immediately before the first update (write or setsize, forexample) is set to occur to the file.

When a common namespace is reconstructed at a remote site, the metadataand the sha1 digest, if any, will also be duplicated at the MFM locatedat the remote site. The remote file virtualization appliance (MFM) willguarantee that as long as a parent directory is synchronized or isup-to-date with the authoritative copy at the central site, the metadataof all files and directories contained in the parent directory will alsobe up-to-date.

When a client at a remote client opens a file stored at the centralsite, the open request is actually sent to the MFM located at the remotesite. The process to open a file is as follows:

The parent directory of the file to be opened is checked to see if it issynchronized with the authoritative copy stored at the central site, asdescribed further herein. If the namespace is not synchronized, the openrequest is forwarded to the central site. If the open is successful, theauthoritative file handle, hereafter referred to as auth file handle, isreturned to the user. If not, an error code is returned to the user.

If the parent directory is synchronized with its authoritative copy atthe central site, and if the file is open for create, delete, or forupdate, the open request is forwarded to the central site. If the openis successful, the auth file handle is returned to the user. If not, anerror code is returned to the user.

Otherwise, an attempt is made to open the file locally first. If theopen is not successful, an error code is returned. The file handle fromopening the file locally is called the local file handle. Notice thatthe local file is actually a sparse file and does not contain any data(as discussed in the co-patent application). The local file's associatedmetadata may or may not be synchronized with the authoritative copy atthe central site.

If the open of the local file is successful, then the open request isagain forwarded to the central site. If the open at the central site isnot successful, the local file is closed and an error code from thecentral site is returned to the user. This is because the central sitehas the authoritative copy of the file.

If the open of the file at the central site is successful, the localfile handle is associated with the auth file handle. The auth handle isreturned to the user.

When a file request is sent to the MFM, it must include a file handle(the auth file handle). The steps for handling a file identified by theinput file handle are as follows:

If the request is a lock request, the lock request is forwarded to thecentral site. If the lock is not granted, the error code is returnedback to user. If there is no local file handle, a success code is alsoreturned to the user. Otherwise, the sha1 digest is obtained from thecentral site and from the local MFM. If they match, an open mirror filerequest with the file's sha1 digest as input is sent to the mirrorserver. If the mirror exists, a mirror file handle is returned.Otherwise, the mirror handle is set to null.

If the request is a forced lock-release, the process sends a forced lockrequest to the user so that the user can flush their data back to thelocal MFM and the local MFM again sends the modified data back to thecentral site.

If the request is a read operation and if a mirror file handle exists,the request is forwarded to the mirror server. Otherwise, the request isforwarded to the central site. The result from either the mirror serveror from the central site is returned back to user.

If the request is a get file attributes operation and if the local filehandle exists, the request is processed locally, using the local filehandle. Otherwise, the request is forwarded to the central site usingthe auth file handle. The result from either the local site or from thecentral site is returned back to user.

Otherwise, all operations are sent to the central site using the authfile handle. The result is then sent back to the user.

Notice that all locking, write, or update attributes operations are sentto the central site. These operations will always incur the WAN latencyoverhead as well as the WAN transmission speed limitation.

The central site can always request any set of mirror copies stored inthe mirror server at the remote sites to be purged. This is done bysending a list of sha1 digest values to a remote site. The remote siteMFM will then purge all of the mirror copies from the mirror serverwhose sha1 digest matches the sha1 digest values in the purge list.

A variety of ways can be used to preposition the mirror copies on themirror servers at the remote site. Since each mirror copy is uniquelyidentified by its sha1 digest, preposition of mirror copies can be doneat any time and independently without regard to the actual state of thefiles at the central site. For example, the mirror copies can be storedon a removable storage device such as a USB disk or on a DVD and sentvia express delivery nightly from the central site to the remote sites.At the remote site, the mirror copies can be loaded on the mirrorserver. Another method of prepositioning is to use satellites tobroadcast the mirror copies to the remote sites. Of course, if thetransmission speed of the network connection between the remote sitesand the central site is fast enough, unicast or multicast networkingprotocols can be used to preposition mirror copies from the central siteto the remote site via the WAN.

Lazy Metadata Mirroring

Active mirroring is not practical in a WAN environment because the lownetwork bandwidth and high network latency of the WAN makes it difficultto synchronize the contents of a mirrored file at one or more remotesites with the authoritative copy at the central site in a timelymanner, particularly when there are many remote sites whose mirrors willneed to be updated in order to be in sync with the authoritative copy inthe central site. Also, active mirroring in such situations may place aheavy load on the central site's MFM. As a result, clients at the remotesite may end up accessing stale data under some circumstances.

In exemplary embodiments of the present invention, mirroring is dividedinto two processes, namely metadata mirroring and data mirroring.Instead of treating all operations (reads and writes) from the clientsin the same manner, metadata requests and data requests are treateddifferently. Some of these differences are identified below.

By mirroring metadata to the remote MFM, the MFM at the remote site isable to respond directly to the metadata operations (terminate themetadata operations) and thus eliminate most metadata traffic betweenthe remote sites and the central site under normal situations.

The metadata mirroring does not have to be completely in place betweenthe remote site and the central site immediately in order to use thesystem. For example, the remote site initially could have its “root” setto point back to the central site. In this case, the remote MFM justforwards the metadata requests across the wire to the central site (withno particular savings due to the MFM at this point in time). Asbandwidth is available, the central MFM could “push” subdirectory levelsof information to the remote MFM. After each subdirectory is pushed, thecentral MFM should re-verify that the subdirectory has not changed sincebeing pushed, and then notify the remote MFM that the remote MFM now hasa valid mirror of the metadata. From this point in time, the remote MFMcan terminate the metadata operations for that subdirectory, until theremote MFM is told that its mirror of the metadata is no longer valid(the remote metadata will generally be valid since there the mirroringof metadata is synchronous in nature). All other subdirectories thathave not been mirrored continue to point back to the central site. Onlysubdirectories that have valid mirrors are terminated at the MFM at theremote site. In other words, performance advantages may be noticedimmediately when a directory's metadata is mirrored, since thosemetadata requests can now be terminated at the remote site, before theentire set of metadata is mirrored.

This process of mirroring the metadata can continue pushing metadata asWAN bandwidth is available, until all of the metadata for shared filesis pushed to the remote site. At that point, the remote MFM would have acomplete mirror of the appropriate metadata, and maintenance of themetadata will be performed as a part of the synchronous metadatamirroring.

An alternative embodiment of the process of metadata mirroring uses a“pull” model, where the remote MFM requests metadata and the central MFMresponds with the metadata itself. When all of the requested metadatahas been sent, the remote MFM sends a message to the central site MFMasking whether the mirrored metadata sent to the remote MFM is currentlyvalid (the metadata may have become invalid during the period of timewhen the metadata was being shipped from the central site to the remotesite). If the metadata that was sent by the central site MFM is, infact, valid, the central MFM responds back to the remote MFM with a“yes”. If the metadata that was sent was not valid at that instant, thenthe central MFM responds back to the remote MFM with a “no”. If theremote MFM receives a “yes”, then it is able to consider its metadatamirror to be valid, and can terminate metadata requests. If the remoteMFM receives a “no”, then the remote MFM can just drop the metadata thatit received and ask the central MFM to again start sending metadata atan appropriate time (e.g., when network bandwidth is again available).

The pull model embodiment may be preferable in certain embodiments,since central site resources may be limited. One advantage is that themirroring of metadata generally occurs only when WAN bandwidth isavailable, and yet the remote clients can still perform metadataoperations before the mirrored metadata is completed because themetadata operations can be referred back to the central site until themirrored metadata is able to satisfy the request.

The process of mirroring the metadata can be done in a breadth first ordepth first fashion. In some situations, particularly in a Windowsenvironment, it may be better to perform metadata mirroring in a breadthfirst fashion because of the way Windows operates. For example, whenaccessing the file dir1\dir2\dir3\dir4\file.txt, each of the directoriesdir1, dir2, etc. . . . are opened sequentially, until finally thefile.txt file is opened. If a breadth first mirroring is performed, theaccesses early in the full path name are more likely to be terminated atthe remote MFM.

As the subdirectories' metadata is mirrored, sparse files can be used,such that the metadata for each file is copied (size, last access time,last modified time, creation time, owner, permissions, etc.), but thedata is not copied (and thus the file is truly sparse, containingabsolutely no data).

Additionally, or alternatively, prior to the actual metadata mirroring,the remote site MFM or the central site MFM may keep track of remotesite access patterns by remote clients and use those statistics todetermine whether breadth first, depth first, or some combination of thetwo processes is most appropriate for a particularly metadata mirroringoperation. If the statistics are gathered by the central site MFM, thenthey could contain either remote site specific access information orglobal remote site access information (information for all remotesites). This global remote site access information may be particularlyuseful when setting up a new remote site, since there may not be anyaccess information for the remote site yet which is statisticallyrelevant.

In the situation where some metadata is mirrored at a remote site andthe metadata is being updated, there is the potential for accessingstale metadata. Therefore, in an exemplary embodiment of the invention,when metadata is updated at a remote site, the updated metadata isimmediately communicated to the central site, and the central site thennotifies the remote MFMs (metadata is not sent, just a notificationsent) that the remote site metadata is out of sync. The remote MFMs thenconsider their own mirror for that particular metadata to be broken, inwhich case the remote MFMs know that the authoritative copy is back atthe central site so any access to the broken mirrored metadata wouldneed to be satisfied via a call to the central site to fetch themetadata, at least until the mirror is reestablished sometime later(performed lazily).

Lazy Data Mirroring and Reverse Data Mirroring

As discussed above, exemplary embodiments of the MFM generally will notsupport data mirroring from the central site to the remote site(synchronous mirroring will not be supported) because synchronous datamirroring to the remote site can create too much of a burden and networktraffic while performing the data synchronization. Instead, exemplaryembodiments of the invention use so-called “lazy data mirroring” at thefile level from the central site to the remote sites. Selected filesfrom the central site may be mirrored at the remote site. While theseremote mirrors may exist, the authoritative copy is always at thecentral site.

In the situation where a file's data is mirrored at a remote site, andthe file is being updated, there is the potential for accessing staledata. Therefore, in an exemplary embodiment of the invention, when afile is updated at a remote site, the updated data is immediatelycommunicated to the central site, and the remote MFMs are notified (datais not sent, just a notification is sent) by the central site that theremote site data is out of sync. The remote MFMs then consider their ownmirror for that particular file to be broken, in which case the remoteMFMs know that the authoritative copy is back at the central site so anyaccess to the broken mirrored file would need to be satisfied via a callto the central site to fetch the data, at least until the mirror isreestablished sometime later (performed lazily).

Remote clients accessing an in-sync mirrored file on the remote MFM willbe “terminated” at the remote MFM, and the normally required networktraffic will be averted.

This mirroring of data can be performed in any of a variety of ways. Forexample, data can be mirrored when it is first accessed (e.g., mirrordata as it is being accessed, so subsequent accesses will terminate atthe mirrored data on the remote MFM), data can be mirrored usingpre-fetching (e.g., fetching the data based on information such as mostrecently or most frequently accessed data), or data can mirrored usingpre-loading (e.g., pre-load the remote MFM with all data objects of theentire namespace before the MFM is shipped to a remote site with a slownetwork link).

In embodiments that mirror data using a prefetching process, prior tolazy data mirroring, the central site MFM or remote site MFM may keeptrack of remote client access patterns (statistics) and use thosestatistics to determine the order in which files should be lazymirrored. If the statistics are gathered by the central site MFM, thenthey could contain either remote site specific access information, orglobal remote site access information (information for all remotesites). This global remote site access information may be particularlyuseful when setting up a new remote site, since there may not be anyaccess information for the remote site yet which is statisticallyrelevant.

The term lazy data mirroring is used because the mirroring itself doesnot happen synchronously. The mirroring operation generally only occurswhen sufficient bandwidth is available. Note that the breaking of amirror is done synchronously (i.e., immediately). Also note that, in theexemplary embodiments discussed above, the central site always holds theauthoritative copy of the data. Therefore, if a remote site has anyissues (e.g., goes down for an extended period of time), the remote sitecan simply drop its metadata and data and refer back to theauthoritative copy back at the central site while it rebuilds itsmetadata and data mirrors.

Viewing this mirroring process from the point of view of the remotesite, one can consider it “reverse data mirroring”. Before a mirror isestablished, the remote MFM uses the central site copy of the data. Oncethe mirror is established, the remote site has a “valid” mirror of thefile that the remote site will use to terminate data requests. Theremote site's mirror will be valid until the remote MFM is notified thatthe remote mirror is no longer in sync (and thus no longer valid). Atthis point, the remote MFM refers back to the central site authoritativecopy of the file until the mirror is re-established and made valid.

File Synchronization

The actual process of invalidating a lazy mirrored file can be achievedwhen the redirector/LAN manager grants the client a Level1 oplock toaccess and then write a file. In exemplary embodiments of the invention,the remote MFM passes these oplocks through to the central site MFM.When this Level1 oplock is noticed by the central site MFM, the centralsite MFM sends messages to all other remote site MFMs telling them thattheir lazy mirrored data for that file is no longer valid. Subsequentrequests for data for the broken lazy mirrored data would be sent to thecentral site to be satisfied. The data mirror can be resynchronized atsome opportune later time. (Note: if the metadata for the file ischanged, those metadata changes are done synchronously, first going tothe central site MFM, then all remote site MFMs are notified that theirmetadata mirrors are out of sync. The resynchronizing of the remotemetadata mirrors can be done lazily, since the remote site MFMs withbroken metadata mirrors can simply direct requests to the central siteMFM to be satisfied. Eventually, the mirrored metadata can again berebuilt, at some later opportune time).

FIG. 3 depicts an oplock break sequence in accordance with an exemplaryembodiment of the present invention. First, the client wanting to openthe file a.txt issues an oplock request (step 1), which is forwarded byRemote Site MFM-1 to the Central Site MFM (step 2). The Central Site MFMissues a request to break an existing oplock (step 3), which isforwarded by the Remote Site MFM-2 to the client having file a.txt open(step 4). That client issues a request to flush and close file a.txt(step 5), which is forwarded by Remote Site MFM-2 to Central Site MFM(step 6). The Central Site MFM then issues an oplock grant (step 7). TheRemote Site MFM-1 invalidates its mirrored copy of file a.txt (step 7 a)and forwards the oplock grant to the client (step 8), which is thenpermitted to write the file. The Central Site MFM sends commands to allother Remote Site MFMs to invalidate mirrored data for file a.txt (step9).

The sequence shown in FIG. 3 is exemplary, and embodiments of thepresent invention are not limited thereby. It should be noted that someof the steps may be combined or may performed in a different order. Forexample, the Central Site MFM may broadcast a notification or commandfor all Remote Site MFMs to invalidate mirrored data for file a.txt in asingle step either before or after forwarding the oplock grant.

In most systems, most file access is read-only in nature. Also, mostfile data is unlikely to change. Thus, the lazy data mirroring techniquegenerally is a good tradeoff to reduce “synchronized” mirror datatraffic between the central site and a remote site while speeding upnormal read access and eliminating much of the WAN traffic. The lazymirror process generally only performs mirroring operations when surplusWAN bandwidth is available.

One particular advantage of the MFM and the central site file systemname space is that not all of the central site's files need to beshared. In exemplary embodiments of the invention, rules can be createdsuch that only the applicable shared files and directories have theirmetadata mirrored and their data having the lazy data mirror on theremote site.

In exemplary embodiments, once exported, every remote site gets the sameexported (shared) name space such that all remote sites share all thesame subset of files of the central site file system name space.

It should be noted that there are synchronization issues to be addressedin the face of network (WAN) failures (i.e., failures in the networkbetween the central and remote sites). If the MFM was never installed,remote clients would be unable to access data stored on the centralsite, even if a WAN Optimization Appliance (discussed below) wasinstalled. However, if the MFM were installed at both the remote andclient sites, access to data could be maintained under somecircumstances even if the network link goes down. This is because themetadata is mirrored, and the file data could be available locally atthe remote site in the lazy mirror. Of course, this could result in theremote clients accessing stale data (e.g., central site could have beenupdated, but with the network link down, the operation to invalidate thelazy mirror might not be received). This behavior (access to stale data)may be “better” in some instances than losing all access to the data. Inother cases, however, one may never want to access stale data, andinstead make sure that stale data is never accessed. In this case, theremote MFM could be made aware that the network link was down (e.g.,through a heartbeat mechanism or through a mechanism where a ping backto the central office is performed every time the MFM terminates arequest). Allowing access to stale data, or disallowing access to datawhen the network link is down, could be configurable so as to be underadministrative control (and this control would be at the file level, asthe rule for checking the network availability can be specified on afile by file basis, or some other grouping, based on file names, dates,or other attributes, and able to be specified in the MFM rules).

In exemplary embodiments, if a file is updated at a remote site, but thelink to the central site is unavailable, the data update would bedisallowed, because the authoritative copy of the data lives at thecentral site. This is no different for applications than in currentnetwork/WAN configurations where the application needs to deal with thecentral site being inaccessible (e.g., without the MFMs being present).Applications are required to deal appropriately with the write beingdisallowed due to the network being down (e.g., the application can dropthe change or can store and save away the change for later transmittalto the central site).

If a remote site comes back up after being down, it could be updated(made to be in sync) either by dropping its metadata mirrors and lazymirrors of data, and then recreating the metadata and data as bandwidthpermits. Alternatively, it could be brought up to date (made to be insync) via a dirty list mechanism (e.g., operations replayed to theremote MFM from the central site). The MFM could just pass through ALLrequests (metadata and data) until the entire dirty list is replayed andthe MFM is back in sync.

A central concept here is that, if anything happens to the metadata ordata at the remote site, the central site contains the authoritativecopy of the data, and the MFM's metadata and data can be recreated.Because of this, the MFM at the remote site does not necessarily need tobe backed up nor be made highly available, since requests can still besatisfied by the central site.

It should be noted that the above-referenced functionality can beimplemented without changing any application code or normal clientprocesses.

It also should be noted that the remote MFMs are generally not requiredto implement the full functionality of the central MFM and thereforecould be implemented as a separate product and/or on a differentplatform.

Common Global Namespace in Remote File Virtualization

In an exemplary embodiment, file virtualization technology is used tomaintain a common global namespace between a central site and manyremote sites across the WAN. The namespace exported by a central site ismirrored across all the remote sites. Exemplary embodiments use atransaction log and snapshots of the namespace to facilitatesynchronizing the common namespace. Furthermore, the common namespace ismaintained by performing the synchronization lazily to reduce the needof common namespace synchronization at the remote site.

In summary, exemplary embodiments may use file virtualization toconstruct a common global namespace among a central site and remotesites across the WAN. File virtualization decouples the identificationof a file or directory from the file's or directory's physical storagelocation and therefore a namespace can be constructed independent of theunderlying file systems. The namespace exported by the central site ismirrored across all remote sites to create a common global namespace. Aper-directory transaction log and a namespace snapshot are used at thecentral site to facilitate synchronizing the common namespace among allsites. Remote sites are responsible for synchronizing the commonnamespace and this synchronization is done lazily and only when needed.Other techniques are employed to further reduce the need for remotesites to communicate with the central site for the purpose of checkingwhether the contents of a directory are synchronized.

The storage for the global namespace is constructed from of one or morefile system partitions exported from file servers located at the centralsite. This storage is then used for the global namespace itself. Virtualpartitions are “carved” out of the namespace such that all of thenamespaces contained in each virtual partition are non-overlapping, andthe union of all the namespaces contained in each virtual partition isthe same as the entire global namespace itself. Thus, as depicted inFIG. 4, each non-overlapping global namespace partition, hereafterreferred to as a partition, contains a directory hierarchy consisting ofdirectories, subdirectories, and file objects. Various embodiments allowthe placement of the authoritative copy of metadata of one or more ofthe partitions to reside at a remote site. Therefore, in the exampleshown in FIG. 4, the Engineering department could be at a remote site,and the metadata for the Engineering partition could have itsauthoritative copy reside at that remote site while the metadata for theother partitions could have their authorized copies reside at thecentral site. In exemplary embodiments, the synchronization authorityfor a partition resides at the site that owns the partitioned namespaceand hosts the authoritative copy of the metadata of the partitionednamespace.

Other sites consult the synchronization authority to determine if theirmirror copy of the data or metadata is valid, as well as to requestlocks.

Each partition has a Table of Partition Transactions or log. Anexemplary Table of Partitions Transactions is depicted in FIG. 5. Thistable of partition transactions (300) records all of the transactionsthat have been performed on any directory in the partition.

Each transaction in a partition is identified by a unique transaction id(TID). The TID of a partition is a monotonic increasing number startingfrom 1. The first transaction of a partition has an assigned TID equalto 1. The next assigned TID is one greater than previously assigned TID.A TID, once assigned, will not be reassigned or reused.

In addition, the partition also records the Lowest Transaction ID (330),the Highest Transaction ID (340), and a Snap Transaction ID (350) whichwill be described shortly. Each entry (301) in the partition transactiontable (300) consists of a Transaction ID (310) of the transaction, andthe Parent Directory (320) on where the transaction was performed.

Each directory in the directory hierarchy that is in the globalnamespace contains a Table of Directory Transactions or Log. Anexemplary Table of Directory Transactions or Log is depicted in FIG. 6.This Table of Directory Transactions (400) records every transaction(401) that has been performed on that particular directory.

The contents of a Table of Directory Transactions (400) consists ofTransaction ID (410) which identifies the transaction that operated onthe directory, Deleted (420) indicating that a directory wassubsequently deleted (and this operation may be skipped in certaininstances), File or Subdirectory Name (430), Action (440) describedbelow, Attributes (450) which include all necessary attributes such asaccess permissions, create and deletion times, etc. . . .

In addition, each directory also records the Highest Child TransactionID (460) which is the highest transaction ID of files or subdirectory inthis directory, the Highest Descendant Transaction ID (470) which is thehighest transaction ID of any file or subdirectory in this directory, inany subdirectory of this directory, in any subdirectory of thosesubdirectories, etc., My Created Transaction ID (480) which is thetransaction ID of when this directory was created, and My LastTransaction ID (490) which is the last transaction ID that was enteredinto this table. The transaction entry (401) with a Transaction ID (410)equal to My Last Transaction ID (490) may not be currently present inthe table. This is because the table's entries (401) may have beentrimmed. Trimming will be explained shortly.

An Action (440) will always be one of the following types: Create file,create directory, rename file, rename directory, delete file, deletedirectory, changing the size of a file or changing any of the file ordirectory attributes.

Note: If the source or destination target in the rename operation is NOTin the same directory, it will be recorded as a delete operation in thesource directory and the create operation as the target directory.

The Table of Partition Transactions (300) and all of the Tables ofDirectory Transactions (400) will continue to grow infinitively asadditional transactions are performed. Therefore, it is neededperiodically to trim the tables. Trimming is performed by firstmirroring the entire partition directory hierarchy without user dataonto a mirror partition. That is, the entire directory tree structure ismirrored, but not the data. All files will become sparse files (sparsefiles are files that do not occupy any storage) but with the file sizeset correctly. The mirror between the partition and its mirror partitionis then broken at a specific time. The mirror partition now contains asnapshot of the metadata of the original partition at a specifictransaction ID, which is referred to as the Snap Transaction ID (350).The mirror partition containing the metadata snapshot is hereafterreferred to as a snapshot.

Once the snapshot is created, the Table of Partition Transactions (300)and all of the Tables of Directory Transactions (400) can be trimmed.Trimming means that all of the transaction entries (301 and 401) with aTransaction ID (310 and 410) that is less than or equal to the SnapTransaction ID (350) can be deleted from the tables (300 and 400). Asnew transactions occur on the partition, they are appended to the Tableof Partition Transactions (300) and the appropriate Table of DirectoryTransactions (400). The snapshot represents the state of the partitionat the Snap Transaction ID (350) which should be equal to 1 less thanthe Lowest Transaction ID (330), since transaction IDs are monotonicallyincreasing by one each time.

The snapshot mechanism itself is frequently provided by the native filesystems used as storage for the global namespace. For example,Microsoft's NTFS provides a snapshot facility with their VSS. Suchnative snapshot mechanisms can be used to optimize the mechanism tocreate a partition snapshot.

The Table of Partition Transactions (300) and the Tables of DirectoryTransactions (400) are used to facilitate the synchronization of mirrorsat remote sites.

Given a Table of Partition Transactions (300) that has not been trimmed,at the remote site one can simply apply all of the transactions in thistable to an empty partition, to create a mirror of the currentpartition's namespace. Once a Table of Partition Transactions (300) hasbeen trimmed, at the remote site one simply needs to first reconstructthe common global namespace by copying the snapshot from the centralsite to the remote site. Then, starting with the snapshot of thepartition at Snap Transaction ID (350), apply all of the entries (301)in the Table of Partition Transactions (300) that have a Transaction IDgreater than the Snap Transaction ID (350). The result is areconstructed common global namespace at the remote site that is amirror of the central site's current partition namespace.

To enable the synchronization and subsequent use of the global namespaceat a remote site, a few tables are maintained at the remote site. Thefirst, referred to as the Table of Remote Site Replay Transactions(500), is an augmented version of the Table of Partition Transactions(300) with a new column, Done (520), added. An exemplary Table of RemoteSite Replay Transactions (500) is shown in FIG. 7.

Some additional values are associated with the Table of Remote SiteReplay Transactions (500). Lowest Transaction ID (540) is thetransaction ID of the first entry (501) in the table (500). Since theTable of Remote Site Replay Transactions (500) is an augmented versionof the central site's Table of Partition Transactions (300), the remotesite's Lowest Transaction ID (540) value will be the same as the centralsite's Lowest Transaction ID (330) value at the moment the table wascopied.

Another associated value with the Table of Remote Site ReplayTransactions (500) is the Highest Transaction ID (560). HighestTransaction ID (560) is the transaction ID of the last entry (501) inthe table (500). Since the Table of Remote Site Replay Transactions(500) is an augmented version of the central site's Table of PartitionTransactions (300), the remote site's Highest Transaction ID (560) valuewill be the same as the central site's Highest Transaction ID (340)value at the moment the table was copied.

The final associated value with the Table of Remote Site ReplayTransactions (500) is the Last Processed Transaction ID (550). Thisvalue is persistent for each partition whose namespace is mirrored atthe remote site. The Last Processed Transaction ID (550) starts at 0,and gets set to a new value that is the larger of (1) the central site'sSnap Transaction ID (350) at the moment the table was copied from thecentral site and (2) the current Last Processed Transaction ID (550). Astransactions are being replayed, the Last Processed Transaction ID (550)is updated such that all entries (501) less than or equal to the LastProcessed Transaction ID (550) are marked as Done (520) since thoseentries (501) have all been processed.

FIG. 8 shows an exemplary single persistent value kept for eachdirectory in the partition on the remote site. The My Last TransactionID (600) is the value of the last transaction ID that was processed andreplayed in this directory.

All of the tables discussed so far are used for synchronizing the remotesite's namespace with the central site's authoritative namespace. It isthe responsibility of the remote site's MFM to synchronize the contentsof its mirror directory with the authoritative copy at the central site.The basic idea is for the remote site to reconstruct the globalnamespace first from the snapshot and apply the transactions one at atime. When a lookup in the namespace occurs at the remote site, theremote MFM determines if the remote site's global namespace issynchronized enough to satisfy the particular lookup. If notsynchronized enough, a synchronization process to synchronize the globalnamespace at the remote site with the central site is triggered in thebackground, and the lookup of namespace information is satisfied byusing the central site's global namespace, the authoritative copy.

An exemplary algorithm to determine if the remote site's mirror copy ofthe namespace is synchronized enough is shown in FIG. 9.

The following are the steps to perform a lookup of metadata at theremote site, as shown in FIG. 9:

Step 1 (705): Initialization steps include setting the CurrentPath=thepartition that the file of interest is one, as well as setting theFullPath=the full pathname, excluding the filename or last component ofthe path if the path refers to a directory (for example, the FullPath of\partition\dir1\dir2\filename.txt is \partition\dir1\dir2 while theFullPath of \partition\dir1\dir2 is \partition\dir1). The last componentof the pathname (filename.txt and dir2 respectively in the examples)will be either resolved locally or at the central site.Step 2 (710): Retrieve the four values Highest Child Transaction ID(460), Highest Descendant Transaction ID (470), My Created TransactionID (480), and My Last Transaction ID (490) from the central site for thedirectory CurrentPath. The Highest Descendant Transaction ID (470) forthe root directory is identical to the Highest Transaction ID (340).Step 3 (715): Determine if a background synchronization should beperformed by comparing remote site's Highest Transaction ID (560) withcentral site's Highest Descendant Transaction ID (470) of the root ofthe partition. If a synchronization should be performed, actuallyperform the synchronization in the background at an appropriate time(synchronizations can be set to occur no more frequently than aspecified interval, for example).Step 4 (725): Determine if the remote site's mirror at CurrentPath canbe used, or if the authoritative copy at the central site must be usedby comparing the remote site's My Last Transaction ID (600) for theCurrentPath directory with the central site's My Last Transaction ID(490) previously returned. If the remote mirror cannot be used, kick offa synchronization of the mirror (735), and satisfy the request with theauthoritative copy of the metadata from the central site (740) and exitthe process).Step 5 (750 and 755): At this point, the remote site's mirror can beused. Check if the algorithm is done by checking if the CurrentPath isequal to the full pathname needed (FullPath). If so, exit the processand use the remote site's mirror to lookup the last component of thepathname.Step 6 (765): Set CurrentPath=CurrentPath+the next piece of the pathfrom FullPath.Step 7 (770 and 775): Determine if the remote site's mirror atCurrentPath can be used, or if the authoritative copy at the centralsite must be used by comparing the remote site's My Last Transaction ID(600) for the CurrentPath directory with the central site's HighestChild Transaction ID (460) previously returned. Note that this HighestChild Transaction ID (460) is a property of the Table of DirectoryTransactions (400) for the parent directory of CurrentPath. Failing thistest does not indicate that the remote site mirror cannot be used.Failure merely indicates that one child mirror of the parent is stale.The CurrentPath directory might still be OK, and this needs to bechecked. If the test failed, then go to step 10.Step 8 (780): Check if the algorithm is done by checking if theCurrentPath is equal to the full pathname needed (FullPath). If so, exitthe process and use the remote site's mirror.Step 9 (795): Set CurrentPath=CurrentPath+the next piece of the pathfrom FullPath.Step 10 (797): Retrieve the four values Highest Child Transaction ID(460), Highest Descendant Transaction ID (470), My Created TransactionID (480), and My Last Transaction ID (490) from the central site for thedirectory CurrentPath.

In summary, if the remote site's locally mirrored metadata at anyparticular level of the directory structure can be used, then it isunnecessary to send this particular data from the central site to theremote site. If the remote site's mirrored metadata cannot be used, thena resynchronization is kicked off (735) in the background, and thecentral site is used to satisfy the metadata requests (740) until thesynchronization is completed.

As mentioned earlier, synchronization of the mirror at the remote sitewith the central site's authoritative copy is the responsibility of theMFM at the remote site. Once synchronization is needed, the exemplaryalgorithm in FIG. 10 may be used to perform the actual synchronization.

The following are the steps of the synchronization process (800)performed by the remote site's MFM:

Step 1 (805): Get the Table of Partition Transactions from the CentralSite. Augment the table to create the Table of Remote Site ReplayTransactions (500) by setting the Done column (520) of each entry (501)to “FALSE”.Step 2 (810): Determine if a synchronization is really needed bychecking the Remote Site's Last Processed Transaction ID (550) againstthe Central Site's Highest Transaction ID (340)Step 3 (825): Check if the current remote site's metadata is sufficientto work with the Table of Remote Site Replay Transactions (500) bychecking to make sure that the last Snap Transaction ID (350) is lessthan the remote site's Last Process Transaction ID (550). If notsufficient, continue with Step 4, otherwise go to step 7.Step 4 (830): Get the central site's snapshot as the base to replaytransactions against.Step 5 (835): Get the Table of Partition Transactions from the CentralSite. Augment the table to create the Table of Remote Site ReplayTransactions (500) by setting the Done column (520) of each entry (501)to “FALSE”. This is done again to make sure that the latest table (300)has been retrieved, since the table (300) may have changed since theinitial retrieval, while the snapshot metadata was retrieved.Step 6 (840): Set the remote site's Last Processed Transaction ID (550)equal to the central site's Snap Transaction ID (350)Step 7 (850): The first entry (501) of to work with is the firsttransaction with a Transaction ID (510) greater than Last ProcessedTransaction ID (550) that also has Done (520)=“FALSE”.Step 8 (855): If no such entry (501) exists, then the synchronizationprocess is complete, otherwise continue.Step 9 (865): Retrieve a copy of the central site's Table of DirectoryTransactions (400) for this entry's (501) Parent Directory (530).Step 10 (870): For each entry in the remote site's copy of the Table ofDirectory Transactions (500), replay the transaction in this directory.However, there is no need to replay transactions (400) whose TransactionID (410) is greater than the Remote Site's Highest Transaction ID (560).This situation may arise since transactions continue to occur, but thisalgorithm continues to use the previously retrieved Table of PartitionTransactions (300). As each transaction is replayed on the remote site,mark the Done (520) value to “TRUE” in the Table of Remote ReplayTransactions (500). When done, the remote site's copy of the Table ofDirectory Transactions (400) can be deleted. The value of thisdirectory's My Last Transaction ID (600) is set to the last transactionID replayed, and persisted.Step 11 (875): Get the next entry (501) from the Table of Remote ReplayTransactions (500) that is larger than the remote site's Last ProcessedTransaction ID (550) that also has Done (520)=“FALSE”. Set LastProcessed Transaction ID (550) to the Transaction ID (510) immediatelypreceding this entry. Continue with Step 8.

In summary, it is the responsibility of the remote MFM to actuallyperform the synchronization. If a mirror is not available, the neededmetadata is always available at the central site where the authoritativecopy exists.

EXAMPLE

The following is an example of the how the Table of PartitionTransactions (300) and the Tables of Directory Transactions (400) aremaintained as files and directories are added and deleted from apartition.

Shown in FIG. 11 is a sample starting state of a partition. A directorystructure is shown (1300), consisting of three partitions (Finance,Marketing, and Engineering). The Finance partition consists of 2subdirectories: Reports (1301) and Models (1304). The Reports (1301)subdirectory has the further subdirectory 3Q07 (1302). In 3Q07 (1302) isa single file, Corp.pdf (1303). In the Models directory (1304) is asingle file, EngFinance.xls (1305).

The Table of Partition Transactions for \Finance is shown (1310). Thetable has been previously trimmed since Snap TID (1329) is a valuelarger than 0. A number of transactions are in the Table of PartitionTransactions (1310) and the Tables of Directory Transactions (1330,1340, 1350, and 1360). The transactions which are of interest are thecreation of the directory \Finance\Reports\3Q07 as Transaction ID 201(1311), the creation of the file \Finance\Reports\3Q07\Corp.pdf asTransaction ID 210 (1312), and the creation of the file\Finance\Models\EngFinance.xls as Transaction ID 227 (1313).

The first operation that will be performed to transition from FIG. 11 toFIG. 12 is to delete the file \Finance\Reports\3Q07\Corp.pdf (1303).This causes a new Transaction ID 371 to be entered into the Table ofPartition Transactions (1414), as well as the Table of DirectoryTransactions for \Finance\Reports\3Q07 (1452). It should be noted thatTransaction ID 210 (1451) is also modified to change the Deleted flag to“True”, since the file is now deleted, and the Transaction 210 can besafely skipped in some instances.

Additionally, other values are updated appropriately, based upon thecurrent state of the partition (1428, 1437, 1446, 1447, 1456, 1457, and1459).

To transition from FIG. 12 to FIG. 13, the directory\Finance\Reports\3Q07 (1402) is deleted (since the directory is nowempty, this is possible). This results in a new transaction, TransactionID 413, which is in entered in the Table of Partition Transactions for\Finance (1515) and the Table of Directory Transactions for thedirectory \Finance\Reports (1542). Also, the Transaction 201 (1541) ismodified to indicate that the directory was subsequently deleted (thistransaction does not need to be replayed in some instances). The Tableof Directory Transactions for the directory \Finance\Reports\3Q07 (1450)is removed in the transition to FIG. 13.

In addition, other values are updated appropriately, based upon thecurrent state of the partition (1528, 1536, 1537, 1546, 1547, and 1549).

At some later point in time, the directory \Finance\Reports\3Q07 (1602)is created again in transitioning from FIG. 13 to FIG. 14. It should benoted that this is an entirely new directory. The previous 3Q07directory was deleted. A new Table of Directory Operations (1650) iscreated for 3Q07 (and note that My Created ID (1658) for 3Q07 is setwith the appropriate Transaction ID). Also, a Transaction ID 550 isentered in the Table of Partition Transactions (1616) and the Table ofDirectory Operations for directory \Finance\Reports (1643).

In addition, other values are updated appropriately, based upon thecurrent state of the partition (1628, 1636, 1637, 1646, 1647, and 1658).

As the final step of this example, the file\Finance\Reports\3Q07\Corp2.pdf (1703) is created in the transition fromFIG. 14 to FIG. 15. This results in the Transaction ID 555 being addedto the Table of Partition Transactions (1717) as well as the Table ofDirectory Transactions for \Finance\Reports\3Q07 (1751).

In addition, other entries are updated accordingly, based upon thecurrent state of the partition (1728, 1737, 1746, 1747, 1756, and 1757).

Authoritative Copy Maintained at Remote Site

As discussed above, a mirror server at a remote site may contain thecurrent, past, or both current and past mirror copies of theauthoritative copy of files stored at the central site. No particulardirectory structure is assumed.

In conjunction with the mirror server, the traditional host server atthe remote site may be replaced with a Solid State Disk (S SD) or otherequally highly reliable storage device at the remote site, which can beaccessed directly from the MFM. The SSD or other highly reliable storagedevice, because of this high reliability, does not require backup andmaintains its data even in the event of catastrophic failure. The use ofa SSD or other highly reliable storage device can do away with the needfor a managed edge server at the remote sites.

Traditionally, edge servers provide the ability for a remote site tostill be able to access and modify data on the edge server, even if theWAN connection is unavailable. Removal of edge servers (by moving thedata to the central site) has traditionally meant that the ability for aremote site to access and modify data is unavailable when the WANconnection is unavailable. Certain embodiments remove the need for themanaged edge server at the remote site, while still preserving theability to access and modify data when the WAN is unavailable. Anexemplary system is shown in FIG. 16.

As discussed above with reference to FIG. 4, the global namespace is“carved” into non-overlapping virtual partitions such that all of thenamespaces contained in each virtual partition are non-overlapping, andthe union of all the namespaces contained in each virtual partition isthe same as the entire global namespace itself. Thus, eachnon-overlapping global namespace partition, hereafter referred to as apartition, contains a directory hierarchy consisting of directories,subdirectories, and file objects.

In exemplary embodiments, the authoritative copy of metadata of one ormore of the partitions may reside at a remote site. Thus, for example,with reference again to FIG. 4, the Engineering department could be at aremote site, and the metadata for the Engineering partition could haveits authoritative copy reside at that remote site.

In addition, the synchronization authority for a partition resides atthe site that owns the partitioned namespace and hosts the authoritativecopy of the metadata of the partitioned namespace. As discussed above,other sites consult the synchronization authority to determine if theirmirror copy of the data or metadata is valid, as well as to requestlocks.

In exemplary embodiments, when the synchronization authority resides ata remote site, the central site is responsible for synchronizing apartitioned namespace with the remote site that is the synchronizationauthority of the partitioned namespace. All other remote sites willcontinue to synchronize the partitioned namespace with the central site.Other than the central site and the remote site that is thesynchronization authority of a partitioned namespace, all other remotesites are not aware that the central site is not the synchronizationauthority of a partitioned namespace.

Thus, in an exemplary embodiment, it is the responsibility of thecentral site to synchronize its mirror copy of the partitioned namespacewith the authoritative copy of the partitioned namespace owned by aremote. The steps to synchronize a common namespace between two sitesare described generally above.

Access to files in a partitioned namespace controlled by a remote sitefrom other remotes sites will continue to go through to the centralsite. Some data read operations may be satisfied by a mirror copy at thecentral site but most other operations such as locking, writes, orupdating attributes will go through two hops. First, the other remotesites send their request to the central site. Then the central sitedetermines if the actual synchronization authority resides at some otherremote site, and forwards the request to that remote site that actuallyowns the authoritative copy of the data or metadata.

In certain embodiments, the SSD or other highly-reliable storage deviceis used to reliably store the authoritative copy of files that are inthe namespace owned by a remote site. However, the metadata of thenamespace owned by the remote is stored in the MFM at the remote site.Each file stored in the SSD or other highly-reliable storage device isidentified by a 128-bit globally unique file ID. However, generallyspeaking, not all authoritative copies in the namespace owned by theremote site are in the SSD or other highly-reliable storage device. Thisis because the SSD or other highly-reliable storage device is generallya relatively expensive device. To allow additional files to be stored inthe SSD or other highly-reliable storage device, files in the SSD orother highly-reliable storage device may be purged to reclaim space.Prior to the file being deleted within the SSD or other highly-reliablestorage device, the data is copied to the central site. Furthermore, toenable the remote site to operate even if the WAN is unavailable, thedata is typically also copied to the mirror server. Only then is thefile deleted from the SSD or other highly-reliable storage device.

Before a write operation is allowed to update a file in the partitionednamespace owned by a remote site, the file to be updated must exist inthe SSD or other highly-reliable storage device. If the file is notalready present in the SSD or other highly-reliable storage device, thefile must be copied to the SSD or other highly-reliable storage devicefrom either a mirror copy stored locally at the remote site or from themirror copy at the central site. Only after a file is stored in the SSDor other highly-reliable storage device is the file allowed to bewritten at will.

After some period of time (either a timeout, or there is a need torecover space within the SSD or other highly-reliable storage device, orthe file is no longer being written), the MFM copies the data to thecentral site (thus a mirror copy of the data exists at the centralsite). This allows the central site to backup the data, instead ofhaving to do backup at the remote site.

The SSD or other highly-reliable storage device is unmanaged storage notrequiring backup and restore type administration due to its very natureof being highly reliable. All management of the SSD or otherhighly-reliable storage device (including copying into or out of the SSDor other highly-reliable storage device, or deleting files within theSSD or other highly-reliable storage device) is performed by the MFM.

In an exemplary embodiment, as described above, the mirror server isindexed by sha1 digest values to retrieve file contents. The mirrorserver continues to be unmanaged storage not requiring backup andrestore type administration. Since unmanaged storage is relativelyinexpensive, the mirror server at the remote site should be much biggerthan the size of the SSD or other highly-reliable storage device. Asdiscussed above, mirror copies may be purged from the mirror server atany time. If purged, data for a given file will always be available ineither the SSD or other highly-reliable storage device, or at thecentral site.

Management of the mirror server (including adding mirror copies anddeleting mirror copies to free up space) is performed by the MFM. If themirror copy of a file purged from the SSD or other highly-reliablestorage device is not removed from the mirror server, the file may bebrought back into the SSD or other highly-reliable storage device fromthe mirror server on a subsequent write. If the mirror copy of a filepurged from the SSD or other highly-reliable storage device has beenremoved from the mirror server, then the file's contents would need tobe obtained from the central site. With a simple least recently usedalgorithm for purging data contents from the mirror server together witha huge mirror server, retrieval of the “backup” mirror copy from thecentral site should generally not be necessary. In addition, a clean upprocess could be run periodically to remove past mirror copies from themirror server. Thus, this design allows the remote site to continue tooperate on the majority of the local namespace partition even if the WANlink is down.

When a client at a remote site opens a file stored at the remote site,the open request is actually sent to the local MFM. If the authoritativecopy of the open file is located at the central site, the steps asdepicted in the co-application, Remote File Virutalization DataMirroring will be followed. Otherwise, the authoritative copy of theopen file is located locally. In this latter case, the process to opensuch a local file is as follows:

Open the file locally. If the open is not successful, an error code isreturned. The file handle from opening the file locally is called thelocal file handle. In an exemplary embodiment, the local file isactually a sparse file and does not contain any data.

If the open of the local file is successful, the local file handle isreturned to the user. At the same time, the GUID of the file isretrieved from the metadata. The GUID is used to open the authoritativecopy of the file stored in the SSD. If the open file is successful, thereturned file handle, the ssd_file_handle, is associated with thelocal_file_handle.

If the open GUID fails, and the open is for read, then the sha1 digestis retrieved from the metadata and the sha1 digest for the file is thenused to obtain a mirror file handle from the mirror server. If a mirrorfile handle is returned, the mirror file handle is associated with thelocal_file_handle and the open is done.

Otherwise, the file is marked as not ready. A background process is usedto bring a copy of the file from either the mirror server or from themirror copy located at the central site. The open operation is complete.

When a file request is sent to the MFM, it includes a file handle (thelocal or the auth file handle). If it is an auth file handle, then thesteps described above with reference to remote file virtualization datamirroring will be followed. Otherwise, the steps for handling a fileidentified by the local file handle are as follows:

If the local_file_handle is marked as not ready, the request will besuspended until the local_file_handle is ready (i.e. the file to beopened is copied into the SSD or other highly-reliable storage device).

If the request is a read operation and if the GUID file handle exists,the GUID file handle is used to retrieve the data. Otherwise, if theGUID file handle does not exist, the mirror handle is used to retrievethe data from the mirror server. The result from either the SSD (orother highly-reliable storage device) or the mirror server is returnedto the user.

If the request is a write operation, the GUID file handle is used towrite the data to the SSD or other highly-reliable storage device.

If the request is an ioctl call sent from the background copy processinforming that the file has been copied into the SSD or otherhighly-reliable storage device, then the GUID of the file is obtainedfrom the metadata and is used to obtain a GUID file handle from the SSD.After the GUID file handle is obtained, the not ready flag for the fileis cleared, and those waiting for the not ready flag to be cleared willbe woken up and their operations resumed.

Otherwise, all operations are sent to the MFM and processed locally.

WAN Optimization Appliances

As discussed above, WAN Optimization Appliances are sometimes used inWAN environments in order to accelerate remote file access. FIG. 17shows an exemplary switched file system in which WAN OptimizationAppliances (i.e., the two boxes labeled “A”) are interposed between theremote file switch and the central file switch. An example of a WANOptimization Appliance is the STEELHEAD™ appliance sold by RiverbedTechnologies Inc., which claims to speed up the TCP traffic between acentral site and a remote site to provide 5 to 50 and in some cases 100times better performance. Such appliances achieve such a performanceboost by reducing, if possible, the size of each TCP message sentbetween the remote and the central site (i.e., a form of datacompression) and/or pre-sending messages from a remote site to thecentral site or vice versa (sometimes referred to herein as “spoofing”or “pre-fetching”).

To reduce the size of a message sent, the message is decomposed into anumber of variable length fragments. A “fingerprint” is then taken foreach fragment. If a receiving site has a fragment that matches thefingerprint of the fragment, that matching fragment will not be sent.The appliances at the central and remote sites are responsible forbreaking up a TCP message into fragments and re-assembling it back atthe other end of the link. The user application is completely unaware ofthis.

With spoofing, the appliance predicts (e.g., from peeking at the replyof a CIFS message that responds to a CIFS read message) a set ofmessages that is likely to be generated by the receiving site after theCIFS read reply message is received. The appliance then creates a set ofmessages containing additional CIFS read requests on the same file butwith different file offsets, and sends these messages immediately backwithout waiting for the actual requests. The fragments from these replymessages are then kept by the appliances for future use. This techniqueincreases the likelihood that a fragment that is likely to be requestedwill already be in the appliance and is somewhat analogous to“pre-fetching”techniques used by file systems to increase sequentialread performance. For example, if a user reads 16K byte data at offset0, the file system may immediately issue a read of another 16K byte dataat offset 16K, in anticipation that the user will likely issue the nextread call during a sequential read operation.

WAN Optimization Appliances of the types just described have certainlimitations. For example, while reducing the length of a message betweena remote and a central site can certainly speed up the traffic, it wouldbe better to not have to send a message in the first place. Thecaching/mirroring of metadata and data discussed above with reference tothe switched file system shown in FIG. 2 can eliminate some WAN trafficaltogether. Also, spoofing in the manner discussed above may not bepossible or practical in file systems that employ SMB signing or otherclient/server authentication mechanisms between the clients and fileservers. The main purpose of SMB signing is to prevent injection of CIFSmessages between a client and a file server (i.e., to specificallyprevent the type of spoofing just described). Currently SMB signing isby default not enabled in a file server. However, if a file servermachine is also used as a domain controller, then SMB signing isautomatically enabled by default. If SMB signing is enabled for all fileservers (which may become the default setting for CIFS file servers),the appliance will not be able to use the above-mentioned messagepre-sending technique to boost the performance because it will not beable to generate the proper SMB signings for the spoofed messages.

File Switch with WAN Optimization

In alternative embodiments of the present invention, WAN optimizationfunctionality of the types described above (including data compressionand/or spoofing) may be integrated into the MFM devices. FIG. 18 shows afile switched system having two file switches with WAN optimizationfunctionality (represented by the box with letter “A” in each fileswitch) in accordance with an exemplary embodiment of the presentinvention. As discussed above, the MFM uses caching/mirroring of bothdata and metadata in order to eliminate some WAN traffic. When WANcommunication is used between two MFMs, the MFMs could employ datacompression to reduce the size of WAN messages.

An MFM could also employ pre-fetching in order to pre-fetch data and/ormetadata from another MFM or from a file server. It should be notedthat, since the MFM already operates as a true client vis-à-vis the fileservers, file requests generated by the MFM (including spoofed messages)could be properly SMB signed so as to operate with file servers thatrequire SMB signing.

WAN Optimization Appliance with SMB Signing

In additional embodiments of the present invention, some MFM-typefunctionality (e.g., spoofing with SMB signing) could be incorporatedinto a WAN Optimization Appliance. FIG. 19 shows an exemplary systemincluding two WAN Optimization Appliances with SMB signing functionality(represented by the two boxes labeled “A+”). Here the appliance wouldtake on the role of a client for communication with the file servers andwould implement SMB signing. Messages sent by the appliance to the fileservers (including spoofed messages) could then be properly SMB signed.Such appliances could be used with or without MFMs.

Additional WAN Optimization Functionality for Remote File Virtualization

In additional embodiments, the separate appliances shown in FIG. 17and/or the MFMs with embedded WAN optimization as shown in FIG. 18 couldprovide a broadcast service for delivering mirror break messagesreliably and in priority from the central site to the remote sites.

Additionally, or alternatively, the separate appliances shown in FIG. 17and/or the MFMs with embedded WAN optimization as shown in FIG. 18 couldprovide an efficient file transfer service for pre-positioning filesfrom a central site to the remote sites. For example, optimalfingerprints can be obtained from a set of files to be pre-positionedand these fingerprints could be pre-positioned to all remote sites.Also, optimal fingerprints could be obtained from all file objects inthe global namespace for fingerprint preloading at remote sites.

It should be noted that terms such as “client” and “server” are usedherein to describe various communication devices that may be used in acommunication system, and should not be construed to limit the presentinvention to any particular communication device type. Thus, acommunication device may include, without limitation, a bridge, router,bridge-router (brouter), switch, node, server, computer, or othercommunication device.

The present invention may be embodied in many different forms,including, but in no way limited to, computer program logic for use witha processor (e.g., a microprocessor, microcontroller, digital signalprocessor, or general purpose computer), programmable logic for use witha programmable logic device (e.g., a Field Programmable Gate Array(FPGA) or other PLD), discrete components, integrated circuitry (e.g.,an Application Specific Integrated Circuit (ASIC)), or any other meansincluding any combination thereof. In a typical embodiment of thepresent invention, predominantly all of the NFM logic is implemented asa set of computer program instructions that is converted into a computerexecutable form, stored as such in a computer readable medium, andexecuted by a microprocessor within the NFM under the control of anoperating system.

Computer program logic implementing all or part of the functionalitypreviously described herein may be embodied in various forms, including,but in no way limited to, a source code form, a computer executableform, and various intermediate forms (e.g., forms generated by anassembler, compiler, linker, or locator). Source code may include aseries of computer program instructions implemented in any of variousprogramming languages (e.g., an object code, an assembly language, or ahigh-level language such as Fortran, C, C++, JAVA, or HTML) for use withvarious operating systems or operating environments. The source code maydefine and use various data structures and communication messages. Thesource code may be in a computer executable form (e.g., via aninterpreter), or the source code may be converted (e.g., via atranslator, assembler, or compiler) into a computer executable form.

The computer program may be fixed in any form (e.g., source code form,computer executable form, or an intermediate form) either permanently ortransitorily in a tangible storage medium, such as a semiconductormemory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-ProgrammableRAM), a magnetic memory device (e.g., a diskette or fixed disk), anoptical memory device (e.g., a CD-ROM), a PC card (e.g., PCMCIA card),or other memory device. The computer program may be fixed in any form ina signal that is transmittable to a computer using any of variouscommunication technologies, including, but in no way limited to, analogtechnologies, digital technologies, optical technologies, wirelesstechnologies (e.g., Bluetooth), networking technologies, andinternetworking technologies. The computer program may be distributed inany form as a removable storage medium with accompanying printed orelectronic documentation (e.g., shrink wrapped software), preloaded witha computer system (e.g., on system ROM or fixed disk), or distributedfrom a server or electronic bulletin board over the communication system(e.g., the Internet or World Wide Web).

Hardware logic (including programmable logic for use with a programmablelogic device) implementing all or part of the functionality previouslydescribed herein may be designed using traditional manual methods, ormay be designed, captured, simulated, or documented electronically usingvarious tools, such as Computer Aided Design (CAD), a hardwaredescription language (e.g., VHDL or AHDL), or a PLD programming language(e.g., PALASM, ABEL, or CUPL).

Programmable logic may be fixed either permanently or transitorily in atangible storage medium, such as a semiconductor memory device (e.g., aRAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memorydevice (e.g., a diskette or fixed disk), an optical memory device (e.g.,a CD-ROM), or other memory device. The programmable logic may be fixedin a signal that is transmittable to a computer using any of variouscommunication technologies, including, but in no way limited to, analogtechnologies, digital technologies, optical technologies, wirelesstechnologies (e.g., Bluetooth), networking technologies, andinternetworking technologies. The programmable logic may be distributedas a removable storage medium with accompanying printed or electronicdocumentation (e.g., shrink wrapped software), preloaded with a computersystem (e.g., on system ROM or fixed disk), or distributed from a serveror electronic bulletin board over the communication system (e.g., theInternet or World Wide Web).

It should be noted that the section headings used throughout thedetailed description above are for convenience only and do not limit thepresent invention in any way.

The present invention may be embodied in other specific forms withoutdeparting from the true scope of the invention. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive.

1. A switched file system comprising: a central network file manager;and at least one remote network file manager in communication coupled tothe central network file manager via a communication network, whereinthe central network file manager manages reference copies of data andmetadata and wherein the remote network file managers maintain mirroredcopies of data and metadata for use in servicing client requests withouthaving to communicate with the central network file manager.
 2. Aswitched file system according to claim 1, wherein the central networkfile manager and the at least one remote network file manager maintain acommon global namespace.
 3. A switched file system according to claim 1,wherein metadata is mirrored from the central network file manager tothe at least one remote network file manager using a lazy mirroringtechnique.
 4. A switched file system according to claim 3, wherein thecentral network file manager pushes metadata to the at least one remotenetwork file manager.
 5. A switched file system according to claim 4,wherein, after pushing metadata to a remote network file manager, thecentral network file manager verifies that the metadata has not changedsince being pushed and notifies the remote network file manager that themetadata is valid.
 6. A switched file system according to claim 4,wherein the central network file manager maintains statistics regardingaccess patterns by remote clients and pushes the metadata to the atleast one remote network file manager based on the statistics.
 7. Aswitched file system according to claim 3, wherein a remote network filemanager pulls metadata from the central network file manager.
 8. Aswitched file system according to claim 7, wherein, after receivingmetadata from the central network file manager, the remote network filemanager requests confirmation from the central network file manager thatthe metadata is still valid.
 9. A switched file system according toclaim 7, wherein the remote network file manager maintains statisticsregarding access patterns by clients and pulls the metadata from thecentral network file manager based on the statistics.
 10. A switchedfile system according to claim 3, wherein the metadata is mirrored in abreadth-first fashion.
 11. A switched file system according to claim 3,wherein the metadata is mirrored in a depth-first fashion.
 12. Aswitched file system according to claim 3, wherein, when metadata isupdated at a remote network file manager, the remote network filemanager communicates the updated metadata to the central network filemanager, and the central network file manager notifies the remotenetwork file managers that the remote site metadata is unsynchronized sothat the remote network file managers do not use the unsynchronizedmetadata.
 13. A switched file system according to claim 1, wherein datais mirrored from the central network file manager to the at least oneremote network file manager using a lazy mirroring technique.
 14. Aswitched file system according to claim 13, wherein, when a file isupdated at a remote network file manager, the remote network filemanager communicates the updated data to the central network filemanager, and the central network file manager notifies the remotenetwork file managers that the remote site data is unsynchronized sothat the remote network file managers do not use the unsynchronizeddata.
 15. A switched file system according to claim 13, wherein at leastone of the central network file manager and the remote network filemanagers maintain statistics regarding client accesses, and wherein thedata for such data mirroring is selected based on the statistics.
 16. Aswitched file system according to claim 1, wherein the remote networkfile managers pass oplock requests from client devices through to thecentral network file manager.
 17. A switched file system according toclaim 1, wherein the remote network file managers handle oplock breaksand pass oplock breaks through to the client devices.
 18. A switchedfile system according to claim 1, wherein the remote network filemanagers flush cached contents back to the central network file manager,and wherein the central network file manager notifies all remote networkfile managers to break file mirrors for the file.
 19. A switched filesystem according to claim 1, wherein the data and metadata is copiedfrom the central network file manager to the at least one remote networkfile manager according to a set of rules.
 20. A switched file systemaccording to claim 1, wherein the remote network file manager disallowsaccess to mirrored copies of data and metadata when the remote networkfile manager is unable to communicate with the central network filemanager over the communication network.
 21. A switched file systemaccording to claim 1, wherein the remote network file manager disallowsmodification of mirrored copies of data and metadata when the remotenetwork file manager is unable to communicate with the central networkfile manager over the communication network.
 22. A network file managerthat operates as a client to file server nodes and as a server to clientnodes and interacts with both the client nodes and the file server nodesusing the standard network file protocols, wherein the network filemanager implements SMB signing on communications with the file servernodes including SMB signing on messages used to pre-fetch data from thefile server nodes.
 23. A network file manager according to claim 22,wherein the network file manager further implements data compression oncommunications with the file server nodes.
 24. A WAN optimizationappliance that operates as a client to file server nodes, wherein theappliance implements SMB signing on communications with the file servernodes including SMB signing on messages used to pre-fetch data from thefile server nodes.
 25. A WAN optimization appliance according to claim24, wherein the appliance further implements data compression oncommunications with the file server nodes.
 26. A WAN optimizationappliance comprising a broadcast service for delivering mirror breakmessages reliably and in priority from the central site to the remotesites.
 27. A WAN optimization appliance comprising a file transferservice for pre-positioning files from a central site to a number ofremote sites.
 28. A WAN optimization appliance according to claim 27,wherein the appliance obtains optimal fingerprints from a set of filesto be pre-positioned and pre-positions these fingerprints to remotedevices.
 29. A WAN optimization appliance according to claim 27, whereinthe appliance obtains fingerprints from file objects in a globalnamespace for fingerprint preloading at remote sites.